Form Document Representation and Identiication ?

نویسنده

  • Volkan Atalay
چکیده

A form processing system aims to extract meaningful data from a form document for ooce automation 1, 2, 3]. The main interest is to extract user lled-in data which is considered to be meaningful. However, in order to perform such a task, the structure of the form should be known in advance. The form structure can be obtained by processing a blank model form on which no user lled-in data exists. The model form may be stored in a database and when an instance form is presented to the system, its type can be identiied by matching with one of the model forms in the database. Meaningful data can then be automatically extracted. A form document can be easily represented by its image. But, such a physical information may not be appropriate, since the form or its image risks to be modiied geometrically or to be distorted due to printing or digitization. On the other hand, logical structure represents the semantics of the form and same logical structure can be formatted in a variety of physical layouts. The geometrical structure should be mapped to a logical structure by considering the logical relations. In this paper, our aim is to develop a logical representation for form documents and to use the representation for identiication. We propose a heuristic algorithm in order to transform geometric structure of a form into a logical structure by using horizontal and vertical lines which exist on the form. The logical structure is represented by a hierarchical tree. The hierarchy of the tree corresponds to the hierarchy of the blocks in the form document. The proposed representation is close to human point of view for the form structure. Logically same forms have the same hierarchical tree structure. Also, geometrical modiications and slight variations on a form are handled by the proposed representation. A hierarchical structure is proposed to represent the logical layout of a form by using horizontal and vertical lines. The main aim is to partition the form into blocks which can further be divided until cells (the smallest blocks) are reached. The partitioning results in a tree where the root is the form itself and the leaf nodes correspond to the cells. The heuristic behind the approach is that the blocks which contain similar information are grouped together and these group of blocks are separated by the lines relatively longer than the others. Such lines …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

مطالعه تطبیقی نمایندگی در امضای اسناد تجاری (برات، سفته و چک)

  According to the Article 227 of Commerce Law and Article 19 drawing cheque in respective with appointing a representation for issuance of draft and cheque, the following questions have always been present: a) whether this representation exists only at the time of signing a document or it will be present at other stages such as endorsement and assurance, too? b) Whether the responsibility of s...

متن کامل

The Representation of Social Actors in the Graduate Employability Issue: Online News and the Government Document

This paper presents the first part of a larger study on the issue of graduate employability in Malaysia as construed in public discourse in English, a language of power in Malaysia. The term employability itself has many definitions depending on the requirements of government and industry, and in the case of Malaysia, the English-language ability of graduates is inseparable from graduate employ...

متن کامل

A Form Document Image Parser ?

Forms are used extensively as input to many information management systems. An automatic way of processing a form should be preferred over the traditional methods due to high volume and numerous kinds of form documents. A system that minimizes operator error, cost and user interaction has to be developed. This paper proposes a system for automatically parsing form documents that have textual an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007